Deep indexed active learning for matching heterogeneous entity representations

نویسندگان

چکیده

Given two large lists of records, the task in entity resolution (ER) is to find pairs from Cartesian product that correspond same real world entity. Typically, passive learning methods on such tasks require amounts labeled data yield useful models. Active Learning a promising approach for ER low resource settings. However, search space, informative samples user label, grows quadratically instance-pair making active hard scale. Previous works, this setting, rely hand-crafted predicates, pre-trained language model embeddings, or rule prune away unlikely product. This blocking step can miss out important regions space leading recall. We propose DIAL, scalable jointly learns embeddings maximize recall and accuracy matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member representations based powerful transformer highlight surprising differences between matcher blocker creation training objective used train their parameters. Experiments five benchmark datasets multilingual record dataset show effectiveness our terms precision, running time. Code available at https://github.com/ArjitJ/DIAL

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Active Learning for Named Entity Recognition

Deep neural networks have advanced the state of the art in named entity recognition. However, under typical training procedures, advantages over classical methods emerge only with large datasets. As a result, deep learning is employed only when large public datasets or a large budget for manually labeling data is available. In this work, we show that by combining deep learning with active learn...

متن کامل

Entity Matching across Heterogeneous Sources

Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains h...

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

Learning Deep Parsimonious Representations

In this paper we aim at facilitating generalization for deep networks while supporting interpretability of the learned representations. Towards this goal, we propose a clustering based regularization that encourages parsimonious representations. Our k-means style objective is easy to optimize and flexible, supporting various forms of clustering, such as sample clustering, spatial clustering, as...

متن کامل

Deep Learning of Representations

Unsupervised learning of representations has been found useful in many applications and benefits from several advantages, e.g., where there are many unlabeled exemples and few labeled ones (semi-supervised learning), or where the unlabeled or labeled examples are from a distribution different but related to the one of interest (self-taught learning, multi-task learning, and domain adaptation). ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the VLDB Endowment

سال: 2021

ISSN: ['2150-8097']

DOI: https://doi.org/10.14778/3485450.3485455